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Abstract 

The thresholding covariance estimator has nice asymptotic properties for estimating 
sparse large covariance matrices, but it often has negative eigenvalues when used in 
real data analysis. To simultaneously achieve sparsity and positive definiteness, we 
develop a positive definite i\ -penalized covariance estimator for estimating sparse large 
covariance matrices. An efficient alternating direction method is derived to solve the 
challenging optimization problem and its convergence properties are established. Under 
weak regularity conditions, non-asymptotic statistical theory is also established for the 
proposed estimator. The competitive finite-sample performance of our proposal is 
demonstrated by both simulation and real applications. 

Keywords: Alternating direction methods; Large covariance matrices; Matrix norm; Positive- 
definite estimation; Sparsity; Soft-thresholding. 
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1 Introduction 



Estimating covariance matrices is of fundamental importance for an abundance of statisti- 
cal methodologies. Nowadays, the advance of new technologies has brought massive high- 
dimensional data into various research fields, such as fMRI imaging, web mining, bioinfor- 
matics, climate studies and risk management, and so on. The usual sample covariance matrix 



is optimal in the classical setting with large samples and fixed low dimensions (Anderson 



1984), but it performs very poorly in the high-dimensional setting (Johnstone 2001). In 



the recent literature, regularization techniques have been used to improve the sample covari- 



ance matrix estimator, including banding (Wu and Pourahmadi, 2003; Bickel and Levina 



2008a), tapering (Furrer and Bengtsson, 2007 Cai, Zhang, and Zhou, 2010) and thresholding 



(Bickel and Levina, 2008b El Karoui 2008; Rothman, Levina, and Zhu 2009). Banding or 
tapering is very useful when the variables have a natural ordering and off-diagonal entries of 
the target covariance matrix decays to zero as they move away from the diagonal. On the 
other hand, thresholding is proposed for estimating permutation-invariant covariance matri- 
ces. Thresholding can be used to produce consistent covariance matrix estimators when the 
true covariance matrix is bandable (Bickel and Levina [2008b Cai and Zhou, 2011a). In this 
sense, thresholding is more robust than banding/tapering for real applications. 
Let S n 



Rothman, Levina, and Zhu 



(2009) 



{&ij)i<i,j<p be the sample covariance matrix, 
defined the general thresholding covariance matrix estimator as S t/ir = {s\(aij)}i<ij< p , 
where s\(z) is the generalized thresholding function. The generalized thresholding function 
covers a number of commonly used shrinkage procedures, e.g. the hard thresholding s\(z) = 
zl{\ z \>\}, the soft thresholding S\(z) = sign(z)(|;s| — A) + , the smoothly clipped absolute 



deviation thresholding (Fan and Li, 2001) and the adaptive lasso thresholding (Zou, 2006). 
Consistency results and explicit rates of convergence have been obtained for these regularized 
estimators in the literature, e.g. Bickel and Levina (2008a,b), El Karoui (2008), Rothman 



et al. 


(2009 


), 


Cai and Liu 


(2011) 



the minimax rate of convergence under the i\ matrix norm over a fairly wide range of 
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classes of large covariance matrices, where the thresholding estimator is shown to be minimax 
rate optimal. The existing theoretical and empirical results show no clear favoritism to a 
particular thresholding rule. In this paper we focus on the soft-thresholding because it can 
be formulated as the solution of a convex optimization problem. Let || • \\p be the Frobenius 
norm and | • |i be the element- wise £i-norm of all non-diagonal elements. Then the soft- 
thresholding covariance estimator is equal to 

± = argmin hn - ± n \\ 2 F + A|£|i. (1) 

However, there is no guarantee that the thresholding estimator is always positive definite. 
Although the positive definite property is guaranteed in the asymptotic setting with high 
probability, the actual estimator can be an indefinite matrix, especially in real data analysis. 



To illustrate this issue, we consider the Michigan lung cancer gene-expression data (Beer 



et al. , 2002 ) which have 86 tumor samples from patients with lung adenocarcinomas and 5217 



gene expression values for each sample. More details about this dataset are referred to Beer 



et al. (2002) and Subramaniana et al. (2005). We randomly choose p genes (p = 200,500), 
and obtain the soft-thresholding sample correlation matrix for these genes. We repeat the 
process ten times for p = 200 and 500 respectively, and each time the thresholding parameter 
A is selected via the 5-fold cross validation. We found that none of the soft-thresholding 
estimators would become positive definite for both p = 200 and 500. On average, there exist 
22 and 124 negative eigenvalues for the soft-thresholding estimator for p = 200 and p = 500, 
respectively. Figure [I] displays the 30 smallest eigenvalues for p = 200 and the 130 smallest 
eigenvalues for p = 500. 

To deal with the indefiniteness, one possible solution is to utilize the eigen-decomposition 
of S, and project S into the convex cone {S >z 0}. Assume that S has the eigen- 
decomposition S = Yl^=i ^i v i v i> an d then a positive semidefinite estimator S + can be 
obtained by setting S + = Yli=i niax(Aj, 0)vjvi. However, this strategy does not work well 
for sparse covariance matrix estimation, because the projection destroys the sparsity pattern 
of S. Consider the Michigan data again. After semidefinite projection, the soft-thresholding 
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(A) p = 200: the minimal 30 (B) p = 500: the minimal 130 

eigenvalues eigenvalues 
Figure 1: Illustration of the indefinite soft-thresholding estimator in the Michigan lung cancer 
data 

estimator has no zero entry. 

In order to simultaneously achieve sparsity and positive semidefiniteness, a natural solu- 
tion is to add the positive semidefinite constraint to ([!]). Consider the following constrained 
l\ penalization problem 



argmin IIS 



S n ||^/2 + A|S|i. 



(2) 



Note that the solution to ^ could be positive semidefinite. To obtain a positive definite 
covariance estimator, we can consider the positive definite constraint {S >z el} for some 
arbitrarily small e > 0. Then the modified S + is always positive definite. In this work, we 
focus on solving the positive definite S + as follows 



arg mm 



£ n \\ 2 F /2 + 



(3) 



Despite its natural motivation, ^ is actually a very challenging optimization problem 
due to the positive semidefinite constraint. To our best knowledge, the first attempt for 



solving (ph was recently proposed by Rothman (2011) who added the log-determinant barrier 
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function to 



argmin ||S - t n \\ 2 F /2 - rlogdet(S) + A|S| 



i- 



(4) 



where the barrier parameter r is a small positive constant, say 10 4 . From the optimization 
viewpoint, (|4]) is similar to the graphical lasso criterion (Friedman et al. , 2008) which also 



has a log-determinant part and the element- wise £i-penalty. Rothman (2011) derived an 
iterative procedure to solve Q . Rothman (2011 )'s proposal is based on heuristic arguments 
and its convergence property is unknown. 

In this paper we present an alternating direction algorithm for solving ^ directly. Nu- 
merical examples show that our algorithm is much faster than the log-barrier method. We 
further prove the convergence properties of our algorithm and discuss the statistical proper- 
ties of the positive-definite constrained t\ penalized covariance estimator. 



2 Alternating Direction Algorithm 



We use an alternating direction method to solve (|3j) directly. The alternating direction 
method is closely related to the operator-splitting method that has a long history back to 



1950s for solving numerical partial differential equations, see e.g., Douglas and Rachford 



(1956); Peaceman and Rachford (1955). Recently, the alternating direction method has 



been revisited and successfully applied to solving large scale problems arising from different 



applications. For example, Scheinberg, Ma, and Goldfarb (2010) introduced the alternating 



linearization methods to efficiently solve the graphical lasso optimization problem. We refer 
to 



Fortin and Glowinski (1983); Glowinski and Le Tallec (1989) for more details on operator- 



splitting and alternating direction methods. 

In the sequel, we propose an alternating direction method to solve the l\ penalized 
covariance matrix estimation problem ^ under the positive-semi definite constraint. We 
first introduce a new variable and an equality constraint as follows 



(0 ,£ 



ar g mm 



|S - S n |||/2 + A|S|i : £ = 0, >z el}. 



(5) 



The solution to (|5| gives the solution to (|3|. To deal with the equality constraint in (|5]), we 
shall minimize its augmented Lagrangian function for some given penalty parameter /x, i.e. 

L(0, S; A) = ||S - ± n \\ 2 F /2 + A|E|i - (A, - £) + ||0 - S|||/(2/i), (6) 

where A is the Lagrange multiplier. We iteratively solve 

(0 i+1 ,S i+1 ) = argminL(0,£;A ,; ) (7) 

and then update the Lagrangian multiplier A l+1 by 

A i+1 — A i — — S i+1 )//x. 

For we do it by alternatingly minimizing L(0, £; A 1 ) with respect to and S. 
To sum up, the entire algorithm proceeds as follows: 

For i — 0, 1, 2, . . ., solve the following three sub-problems sequentially till convergence 

step : m = arg min L(@, S*; A*) (8) 

©be-*" 

Sstep: S i+1 = argminL(0 i+1 , S; A*) (9) 
A step : A i+1 = A* - (0 m - (10) 

To further simplify the alternating direction algorithm, we derive the closed-form solutions 
for ([8|-(j9j. Consider the step. Define (Z) + as the projection of a matrix Z onto the 
convex cone {0 >z £-T}- Assume that Z has the eigen-decomposition Y^=i ^i v l v ii an d then 
(Z) + can be obtained as Y%=i max(Aj, e)vjv{. Then the step can be analytically solved 
as follows 

l+1 = arg min A 1 ) 



arg min -(A\0) + 110 - £ l ||i/(2//) 
(S i + / iA%. 
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Next, define an entry- wise soft-thresholding rule for all the non-diagonal elements of a 
matrix Z as S(Z,r) = {s(%, T )}i<i,j<p with 

s(zij,r) = sign.(z ij )m.ax(\z ij \ - r,0)I {i -± j} + 

Then the XI step has a closed-form solution given below 

S i+1 = arg minL(0 i+1 , £; A*) 

= argmin ||S - £„|||/2 + A|S|i + (A\£> + ||S - © m |||/(2/i) 
= {5(MS n -A l ) + e t+1 ,A / x)}/(l+/i). 

Algorithm 1 shows the complete details of our alternating direction method for ([3]). In 
Section 4 we provide the convergence analysis of Algorithm 1 and prove that Algorithm 1 
always converges to the optimal solution of ^ from any starting point. 

Algorithm 1 Our alternating direction method for the t\ penalized covariance estimator 

1. Input: /I, S° and A . 

2. Iterative alternating direction augmented Lagrangian step: for the i-th iteration 

2.1 Solve m = (S i + A tA i ) + ; 

2.2 Solve £ m = {S(ji(± n - A*) + A/i)}/(l + //); 

2.3 update A i+1 = A* - - S i+1 )//i. 

3. Repeat the above cycle till convergence. 



In our implementation we use the soft-thresholding estimator as the initial value for both 
@° and S°, and we set A as a zero matrix. The value for ft is 2. Before invoking Algorithm 
1, we always check whether the soft-thresholding estimator is positive definite. If yes, then 
the soft-threhsolding estimator is the final solution to (|3]). 
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3 Numerical Examples 



3.1 Simulation 



Before delving into theoretical analysis of the algorithm and the resulting estimator, we first 
use simulation to show the competitive performance of our proposal. In all examples we 
standardize the variables to have zero mean and unit variance. In each simulation model, 
we generated 100 independent datasets, each with n = 50 independent p-variate random 
vectors from the multivariate normal distribution with mean and covariance matrix So = 
(^ij)i<i,i<p f° r P = 100,200 & 500. We considered two covariance models with different 
sparsity patterns: 

Model 1: 4 = (l-|*-il/10)+. 

Model 2: partition the indices {1, 2, . . . ,p} into K = p/20 non-overlapping subsets of equal 
size, and let ik denote the maximum index in 

K K-l 

a% = 0.61 {i=j} + 0.4^/ {ie4iie4} + 0A^2(I {i=ikJeIk+l} + I{iei k+1 j=i k }) 



k=l 



k=l 



Model 1 has been used in Bickel and Levina (2008a) and Cai and Liu (2011), and Model 2 



is similar to the overlapping block diagonal design used in Rothman (2011). 



First, we compare the run times of our estimator S with the log-barrier estimator X 
by Rothman (2011). As shown in Table [TJ our method is much faster than the log-barrier 
method. 

In what follows, we compare the performance of S + , S + and the soft-thresholding es- 
timator S. For all three regularized estimators, the thresholding parameter was chosen by 



5-fold cross-validation (Bickel and Levina 2008b; Rothman et al. 2009; Cai and Liu, 2011). 



The estimation performance is measured by the average losses under both the Frobenius 
norm and the spectral norm. The selection performance is examined by the false positive 
rate 

^0fe^ = Q} 



Table 1: Total time (in seconds) for computing a solution path with 99 thresholding param- 
eters A = {0.01, 0.02, • • • , 0.99}. Timing was carried out on an AMD 2.8GHz processor. 







Model 1 




Model 2 


V 


100 


200 500 


100 


200 500 


Our method 


9.2 


65.2 1156.0 


7.5 


51.1 986.6 


Rothman's method 


84.1 


822.1 35911.8 


51.3 


611.1 32803.0 



and the true positive rate 

&jj ± & ay 0} 

Moreover, we compare the average number of negative eigenvalues and the percentage of 
positive-defmiteness to check the positive-defmiteness. 

Tableland Table [3] show the average metrics over 100 replications. The soft-thresholding 
estimator X is positive definite in 19 or fewer out of 100 simulation runs, while S + and S + 
can always guarantee a positive-definite estimator. The larger the dimension, the less likely 
for the soft-thresholding estimator to be positive definite. In terms of estimation, both S + 
and S + are more accurate than S. As for the selection performance, S + and S + achieve a 
slightly better true positive rate than S. Overall, S + is the best among all three regularized 
estimators. 



3.2 Real data 



To demonstrate our proposal we further consider two gene expression datasets: one from a 



small round blue-cell tumors microarray experiment (Khan et al. 2001) and the other one 



from a cardiovascular microarray study (Efron, 2009, 2010). The first dataset has 64 training 



tissue samples with four types of tumors (23 EWS, 8 BL-NHL, 12 NB, and 21 RMS), and 
6567 gene expression values for each sample. We applied the pre-filtering step used in Khan 
et al. (2001) and then picked the top 40 and bottom 160 genes based on the F-statistic 
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Table 2: Comparison of the three regularized estimators for Model 1. Each metric is averaged 
over 100 replications with the standard error shown in the bracket. NA means that the results 
for £ + (Rothman's method) are not available due to the extremely long run times. 





Frobenius 


Spectral 


False 


True 


Negative 


Positive 




norm 


norm 


positive 


positive 


eigenvalues 


denniteness 








P 


= 100 






soft 


8.41 


4.02 


24.5 


87.6 


2.24 


53/100 


thresholding 


(0.06) 


(0.04) 


(0.1) 


(0.0) 


(0.14) 


our 
method 


8.40 
(0.06) 


4.02 
(0.04) 


24.8 
(0.1) 


87.8 
(0.0) 


0.00 
(0.00) 


100/100 


Rothman's 
method 


8.40 
(0.06) 


4.02 
(0.04) 


24.5 
(0.1) 


87.7 
(0.0) 


0.00 
(0.00) 


100/100 








P 


= 200 






soft 


13.82 


4.70 


14.3 


83.2 


3.74 


23/100 


thresholding 


(0.06) 


(0.03) 


(0.4) 


(0.3) 


(0.22) 


our 
method 


13.80 
(0.06) 


4.69 
(0.03) 


14.6 
(0.4) 


83.5 
(0.3) 


0.00 
(0.00) 


100/100 


Rothman's 
method 


13.81 
(0.05) 


4.69 
(0.03) 


14.6 
(0.4) 


83.5 
(0.3) 


0.00 
(0.00) 


100/100 








P 


= 500 






soft 
thresholding 


25.15 
(0.11) 


5.28 
(0.04) 


6.3 
(0.2) 


78.1 
(0.3) 


4.64 
(0.60) 


7/100 


our 
method 


25.10 
(0.11) 


5.28 
(0.04) 


6.5 
(0.2) 


78.3 
(0.3) 


0.00 
(0.00) 


100/100 


Rothman's 
method 


NA 
NA 


NA 
NA 


NA 
NA 


NA 
NA 


NA 
NA 


NA 
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Table 3: Comparison of the three regularized estimators for Model 2. Each metric is averaged 
over 100 replications with the standard error shown in the bracket. NA means that the results 
for £ + (Rothman's method) are not available due to the extremely long run times. 





Frobenius 


Spectral 


False 


Irue 


Negative 


Positive 




norm 


norm 


positive 


positive 


eigenvalues 


definiteness 








P 


= 100 






soft 


9.81 


4.87 


29.5 


97.2 


1.54 


19/100 


thresholding 


(0.07) 


(0.05) 


(0.0) 


(0.0) 


(0.14) 


our 
method 


n tq 

y. to 
(0.07) 


A QK 

4.o0 
(0.05) 


oU.z 
(0.0) 


y ( .6 
(0.0) 


n nn 
U.UU 

(0.00) 


100/100 


Rothman's 
method 


y. to 
(0.07) 


A QK 

(0.05) 


oU.U 
(0.0) 


y 7 .6 
(0.0) 


0.00 

(0.00) 


100/100 








p 


= 200 






soft 


15.95 


5.90 


17.1 


94.1 


3.93 


7/100 


j_l 1 11* 

thresholding 


(0.12) 


(0.06) 


(0.4) 


(0.3) 


(0.27) 


our 
method 


10. ol 
(0.12) 


0.84 
(0.06) 


10. 

(0.3) 


yo.u 
(0.3) 


n nn 
U.UU 

(0.00) 


100/100 


Rothman's 
method 


1 £ CO 

15. oo 
(0.12) 


0.85 

(0.06) 


1 o o 
lo.o 

(0.4) 


94.0 
(0.3) 


n nn 
U.UO 

(0.00) 


100/100 








P 


— ouu 






soft 


29.46 


6.92 


7.6 


87.7 


3.84 


4/100 


thresholding 


(0.18) 


(0.07) 


(0.1) 


(0.5) 


(0.78) 


our 
method 


29.17 
(0.20) 


6.84 
(0.06) 


8.7 
(0.2) 


88.8 
(0.6) 


0.00 
(0.00) 


100/100 


Rothman's 
method 


NA 
NA 


NA 
NA 


NA 
NA 


NA 
NA 


NA 
NA 


NA 
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as done in Rothman et al. (2009). The second dataset has 63 subjects with 44 healthy 
controls and 19 cardiovascular patients, and 20426 genes measured for each subject. We 
used the F-statistic to pick the top 50 and bottom 150 genes. By doing so, it is expected 
that there is weak dependence between the top and the bottom genes. We considered the 



soft-thresholding estimator (Bickel and Levina, 2008b), the log-barrier estimator (Rothman 



2011) and our estimator. For all three estimators, the thresholding parameter was chosen 



by 5-fold cross validation. 




150 160 170 180 190 200 150 160 170 180 190 200 

Index Index 



(A) (B) 
Figure 2: Plots of the bottom 50 eigenvalues of all three regularized estimators for the small 
round blue-cell data (A) and the cardiovascular data (B): S (solid), S + (dashed) and S + 
(dotted). 

As evidenced in Plot |2j the soft-thresholding estimator yields an indefinite matrix for 
both real examples whereas the other two regularized estimators guarantee the positive- 
definiteness. The soft-thresholding estimator contains 37 negative eigenvalues in the small 
round blue-cell data, and 46 negative eigenvalues in the cardiovascular data. Regularized 
correlation matrix estimation has a natural application in clustering when the dissimilarity 
measure is constructed using the correlation among features. For both datasets we did 
hierarchical clustering using the three regularized estimators. The heat maps are shown in 
Figure [3] in which the estimated sparsity pattern well matches the expected sparsity pattern. 

Finally, we compared the average run times over 5 cross validations for both S + and S + , 
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(Al) 



(A2) 



(A3) 




(Bl) (B2) (B3) 

Figure 3: Heat maps of the absolute values of three regularized sample correlation matrix 
estimator for the small round blue-cell data (A) and the cardiovascular data (B): S (Al, 
Bl), S + (A2, B2) and S + (A3, B3). The genes are ordered by hierarchical clustering using 
the estimated correlations. 

Table 4: Total time (in seconds) for computing a solution path with 99 thesholding param- 
eters. Timing was carried out on an AMD 2.8GHz processor. 





Blue cell data 


Cardiovascular data 


Our method 


74.7 


66.3 


Rothman's method 


1302.7 


1575.3 
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as shown in Table |4} It is obvious that our proposal is much more efficient. 
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4 Theoretical properties 

4.1 Convergence analysis of the algorithm 

In this section, we prove that the sequence (@\ S l , A 1 ) produced by the alternating direction 
method (Algorithm 1) converges to (@ + , S + , A + ), where (0 + , S + ) is an optimal solution of 
(|5| and A + is the optimal dual variable. This automatically implies that Algorithm 1 gives 
an optimal solution of (j3]). 

We define some necessary notation for ease of presentation. Let G be a 2p by 2p matrix 
defined as 

(l/n)l pxpj 

Define the norm || • || G as \\U\\q = (U,GU) and the corresponding inner product (•, -)g 
as (U,V)c = (U,GV). Before we give the main theorem about the global convergence of 
Algorithm 1, we need the following lemma. 

Lemma 1. Assume that (@ + , S + ) is an optimal solution of (J5]) and A + is the corresponding 
optimal dual variable associated with the equality constraint S = 0. Then the sequence 
{(©', I]*, A 1 )} produced by Algorithm 1 satisfies 

\\Tji _ [/-Ha _ ||£7*+1 _ £7*1^ > ||£7i _ £7 i+1 ||^, (11) 

where U* = (A + , S + ) T and U { = (A*, S^. 

Now we are ready to give the main convergence result of Algorithm 1. 

Theorem 1. The sequence {(0\ S l , A')} produced by Algorithm 1 from any starting point 
converges to an optimal solution of 
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4.2 Statistical analysis of the estimator 

Define S° as the true covariance matrix for the observations X = (Xij) nxp , and define the 
active set of S° = (cTj k )i<j,k< P as A = {(j, k) : <r° fe 7^ 0, j 7^ A;} with the cardinality s = \A \. 
Denote by B Ao the Hadamard product B pxp o (/{(j,fc)eA })i<j,fc<p = (b jk ■ I{(j,k)eA })i<j,k< P - 
Define a max = maxj crj- as the maximal true variance in S°. 

Theorem 2. Assume that the true covariance matrix S° is positive definite. 

(i) Under the exponential-tail condition that for all \t\ < r] and 1 < i < n,l < j < p 

£{exp(fXj)} < K u 

we also assume that \ogp < n. For any M > 0, we pick the thresholding parameter as 

2 \ogp flogp^ 1:2 
A = c h ci 



n \ n 

where 



= leK^ + rj-^iM + l) 

and 



C ° 2 



ci = 2K 1 (t ? - 1 + ^max) exp(^a max ) + 2r ? - 1 (M + 2). 
W^/i probability at least 1 — 3p~ M , we have 

-X°\\ F <5\(s + p) 1/2 . 

(zzj Under the polynomial-tail condition that for all 7 > 0, e > Oand 1 < i < n,l < j < p 

E{\X t3 f 1+ ^} < K 2) 

we also assume that p < cn< for some c > 0. For any M > 0, we pick the thresholding 
parameter as 

A = 8(K 2 + 1)(M + l) 1 ^ + 8(K 2 + 1)(M + 2) ^ l ° g/J ' 



n \ n 

With probability at least 1 - 0(p- M ) - 3X 2 p(logn) 2(1+7+£) n 7 6 , we have 

||S + -S°|| F <5A(s + p) 1/2 . 
15 



Define d = max, J2k I{°~ 3 k¥=o} an< ^ assume that cr max is bounded by a fixed constant, then 
we can pick A = 0((\ogp/n) l l 2 ) to achieve the minimax optimal rate of convergence under 



the Frobenius norm as in Theorem 4 of Cai and Zhou (2011b) that 

logp 



1 

V 



Or, (1 + 



£ logp 

p n 



O p Id- 



However, to attain the same rate in the presence of the log-determinant barrier term, Roth- 



man 



(2011) instead would require that cr m ; n , the minimal eigenvalue of the true covariance 



matrix, should be bounded away from zero by some positive constant, and also that the 
barrier parameter should be bounded by some positive quantity We would like to point 
out that if <r min is bounded away from zero, then the soft-thresholding estimator S si will be 



positive-definite with an overwhelming probability tending to 1, (Bickel and Levina, 2008b 



Cai and Zhou, 2011a[b ). Therefore the theory requiring a lower bound on <7 m i n is not very 
appealing. 



5 Conclusions 

The soft-thresholding estimator has been shown to enjoy good asymptotic properties for esti- 
mating large sparse covariance matrices. But its positive definiteness property can be easily 
violated, which means the soft-thresholding estimator could be in principle an inadmissible 
estimator for covariance matrices. In this paper we have put the soft-thresholding estimator 
in a convex optimization framework and considered a natural modification by imposing the 
positive definiteness constraint. We have developed a fast alternating direction method to 
solve the constrained optimization problem and the resulting estimator retains the sparsity 
and positive definiteness properties simultaneously. The algorithm and the new estimator 
are supported by numerical and theoretical results. 
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Appendix: Technical Proofs 

Proof of Lemma\^ Since (@ + , S + , A + ) is optimal to (|5j), it follows from the KKT conditions 
that the followings hold. 

(-A + - S + + t n ) je /X G d\tj t \, Vj = l,...,p,£=l,...,pfmdj?£, (12) 

(S + - t n ) 33 + Aj = 0, Vj = l,...,p, (13) 

+ = S + , (14) 

€) + h el, (15) 

and 

(A + ,e-e + )<o, ve^e/. (16) 

Note that the optimality conditions for the first subproblem in Algorithm 1, i.e. the 
subproblem with respect to in (fSh , are given by 



(A 1 - (e m - s 4 )/^, e - e i+1 ) < o, ve y ei. (17) 

Using the updating formula for A* in Algorithm 1, i.e., 

A l +! = A* - (£ m - m )//i, (18) 



(17) can be rewritten as 

(A i+1 - (s i+1 - s 4 )/^, e - e i+1 ) < o, ve y ei. (19) 
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Now by letting © = 0* +1 in (|16l) and © = © in (|19|), we can get that 



+ . 



(A ' , - ' ) < 0, 



(20) 



and 



(A i+1 - (£ m - S l ) ///, ' - < 



i+l\ 



(21) 



Summing (20) and (21) yields 



m - © , (A* +i - A ') + (£* - S^ 1 )/^) > 



. i+l 



(22) 



The optimality conditions for the second subproblem in Algorithm 1, i.e., the subproblem 
with respect to £ in (|8| are given by 



G (S i+1 -S n ^+A9|S;| 1 |+A},+(S l+1 -@' ;+1 )^, Vj = 1, . . . ,p,i= 1, . . . ,p, and j ± £, 

(23) 

S n )^ + A}, + (S i+1 -e i+1 ) ii /^ = > Vj = l,...,p. (24) 



and 



i+l 



Note that by using (18), (23) and (24) can be respectively rewritten as 



_ A i+i _ E i+i + ± n }. i / X e d\V$ l \, Vj = 1, . . . ,p,£ = 1, . . . ,p, and j ^ e, (25) 



and 



f^i+l v> v. i A *+l 



^nh • A':' 0. V/ 1 p. 



(26) 



Using the fact that d\ ■ | is a monotone function, (12), (13), (25) and (26) imply 



(£<+i - £ ' , (A - A i+1 ) + (£ ' - £ J+i )) > 0. 



i+is 



(27) 



The summation of (22) and (27) gives 



- £ ' , A ' - A ?+i ) + (0 ' - l+i , A ' - A l+i ) 



i+l 



.*+l\ 



+ (0 ' - i+1 , £ m - £*) In > ||£ 



i+l V 112 
- 2j \\ f . 



(2f 
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Combining p8ft with m = fi(A l - A l+1 ) + S ?+i and ' = E ' leads to 



_ A _ A i+1 ) + (E - E m - fi(A l - A t+1 ), A - A l+1 ) 
+ (E + - E m - fi{A l - A l+1 ), E m - S J )//i > 



Simple algebraic derivation from (29) yields the following inequality: 



fj,(A - A , A 1 - A l+i ) + (E J+i - E , E l - E J+i )/i 
> ||E l+1 - t + \\ 2 F - (A* - A i+1 , E' - E i+1 ). 



Rearranging the terms on the left hand side of (30 ) using © — = (0 — l ) + (0 l 
and E + - E i+1 = (E + - £*) + (£* - £ m ), then d28b can be reduced to 



/i(A l - A , A* - A i+1 ) + (E* - £ , £ l - E m )//i 
> ju||A' — A i+1 || F + ||£* — £ i+1 || F /// + ||£ i+1 — E + ||| — (A' — A i+1 , E* — E i+1 ). 



Using the notation of U l and [/*, (31) can be rewritten as 



(U l - U*, U* - U i+1 ) G >\\u l -u 



Tl+l II 2 



ii v**+i v 1 II z 

G -T \\2j — 2j || f 



- (A* — A i+1 , £* — £ i+1 ) 



Combining (32) with the following identity 



\\U l+1 - U*\\% = \\U i+1 - U*\\% - 2(U k - U i+1 , U l - U*) G + \\IP - U* 



\Gi 



we get 



\\u*-w\\% -\\v 



Ti+l JJ*\\2 



\G 



2(U % -U i+ \U l -C/*}- || U 



i+1 _ jji\\2 



\G 



> 2\\U l - U l+1 \\ 2 G + 2||E i+1 - E || 2 - 2(A l - A i+1 , E i - E i+1 ) - \\U i+1 - 17% 
= WU* - U i+1 \\ 2 G + 2 ll si+1 - S + || 2 - 2(A* - A i+1 , E l - E i+1 ). 



Now, using (25) and (26) for % instead of i + 1, we get, 



and 



-A' - £< + V n )jt/\ e d\i:U, Vj = 1, . . . ,p,£= 1, . . . ,p, and j ± I, 



£* - ± n ) n + A!,- = 0, Vj = l,...,p. 
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Combining (25), (26), (34), (35) and using the fact that d\ • | is a monotone function, we 
obtain, 

(£* - £ i+1 , A i+1 - A* + £ i+1 - E*) > 0, 



which immediately implies, 



(E 4 - E i+1 , A i+1 — A*) > ||E J+1 -E l ||^>0. 



(36) 



By substituting (36) into (33), we get the desired result (11) 



□ 



Proof of Theorem [7| From Lemma [T] we can easily get that 



(i) \\W - U i+1 \\ G -> 0; 



(ii) {U 1 } lies in a compact region; 



iii) ||{7 l — U*\\q is monotonically non-increasing and thus converges. 



It follows from (i) that A*-A i+1 -»■ and £*-£ l+i -»■ 0. Then <fiM implies that l -0 i+i 



i+l 



and l — £* — >■ 0. From (ii) we obtain that, U l has a subsequence {U lj } that converges to 
17 = (A, £), i.e., A^ -> A and £ ij -> E. From 0* - E i -> we also get that 0* ->■ := E. 
Therefore, (0, E, A) is a limit point of {(0\ £\ A*)}. 



Note that (|25[) and (24) respectively imply that 



;-A - E + E n )^/A G Vj = 1, . . .,p,£ = 1, . . . ,p, and j ^ 



and 



+ A« = 0, Vj = l,...,p, 



(37) 



(38) 



and (19) implies that 



(A, - 0) < 0, V0 b e/. (39) 
(37), (38) and (39) together with = £ mean that (0, E, A) is an optimal solution to (j5]). 



Therefore, we showed that any limit point of {(0\ £\ A*)} is an optimal solution to ([5]). □ 
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Proof of Theorem^ Without loss of generality, we may always assume that E(Xij) = for 
all 1 < i < n, 1 < j < p. By the condition that S° is positive definite, we can always 
choose some very small e > such that e is smaller than the minimal eigenvalue of S°. We 
introduce A = E — S°, and then we can write ^ in terms of A as follows, 

1 



arg mm - 1| A + S 

A: A=A T ,A+S°^e7 2 



EJ^ + AIA + E 1 ^ (=F(A)). 



Note that it is easy to see that A = S — S°. 

Now we consider A G {A : A = A T , A + S° >z eI,\\A\\ F = 5As 1/2 }. Under the 
probability event {|cr™- — <j° | < A, V(z, j)}, we have 



F(A) - F(0) 



> 



> 



> 



A + S c 



T II 2 - 



1 



J n\\F 



+ AIA + S !! - AIS !! 



A|||+ < A, S° - t n > +A|A A(C J 1 + A(|A Ao + £°J 



ivO I N 



All 2 



A(|A|i + \ A a\) + A l A ^li - A l A ^ol 

i 

2A(|Aao|i + ^|A«|) 

i 

A\\ 2 F -2X(s + p) 1/2 \\A\\ F 



A||| 



> ^ 2 (* + P) 

> 



Note that A is also the optimal solution to the following convex optimization problem 



arg 



mm 



F(A)-F(0) (=G(A)). 



A:A=A T ,A+S°^ei" 

Under the same probability event, ||A||^ < 5A(s + p) 1 ^ 2 would always hold. Otherwise, the 

1 /2 

fact that G(A) > for ||A||_p = 5A(s + p) should contradict with the convexity of G(-) 
and G(A) < G(0) = 0. Therefore, we can obtain the following probability bound 

Pr(||S + - S°|| F < 5\{s + p) 1/2 ) > 1 -Pr(max|a" - <r°,| > A). 



Now we shall prove the probability bound under the exponential-tail condition. First it 
is easy to verify two simple inequalities that 1 + u < exp(u) < 1 + u + ^M 2 exp(|u|) and 
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f 2 exp(|f I) < exp(t> 2 + 1). The first inequality can be proved by using the Taylor expansion, 
and the second one can be easily derived using the obvious facts that exp(v 2 + 1) > exp(2|i>|) 
and exp(|t>|) > v 2 . 

Let t = fa^) 1/2 , c = \eK x tfl 2 + rf x l 2 {M + 1) and e = c (^) 1/2 . For any M > 0, 
we can apply the Markov inequality to obtain that 

n 

Pr(^ Xij > ne ) < exp(-t ne ) • E[exp(t X i:j )] 

i i=l 

71 ( t 2 1 

< exp(-t neo) J] 1 1 + j E l X i exp(* |^-|)] > 

i=i ^ J 

< p" c ^' 1/2 • exp(| E l X i ex P(^l)D 



i 

< p-co^ a . eX p(| £ £[exp(t 2 X 2 + 1)]) 

i=i 

< p" C0T ' 1/2 •exp(^e^ 1 r7logp) (=p- M_1 ), 



where we apply exp(w) < 1 + u + ^m 2 exp(|-u|) in the second inequality and 1 + u < exp(-u) 
in the third inequality, and then use t> 2 exp(|t>|) < exp(t> 2 + 1) in the fourth inequality. 
Moreover, the simple facts that E[Xij] — (1 < i < n) and tl = < f] are also used. 

Let h = IvC^ 2 ) 112 and Cl = 2K 1 (r ] - 1 + l W 2 m J exp(^ ( x max ) + 2if\M + 2). Define 
£i = Ci(^p) 1 / 2 - For any M > 0, we first apply the Cauchy inequality to obtain that 

E[X 2 X 2 k ■ eM\v\X tJ X lk \)] < E[X 2 X^-exp(^(X 2 + X 2 k ))] 

< {E[Xf 3 exp(^X 2 /2)])V 2 . (E[X? k exp( V X 2 k /2)]f 2 

< A V - 2 ■ (E[eMvXm i/2 ■ (E[eMvXm i/2 

< 4K lV ~ 2 , 

where we use the simple inequality exp(|t> |) > v 2 in the third inequality. Then, combining 
this result with the Cauchy inequality again yields that 

E[(X l3 X lk - a° jk ) 2 ■ eMti\XijX ik - a° jkl 
1 

2' 



< 2E\X 2 X 2 k ■ eMUXijXik - a° jk \)] + 2(a%) 2 • £[exp(^|^JQ fe - a%\ 
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< 8K lV - 2 ■ eM\w%) + 2(a%)' • exp(^4) • £[exp(^(Xg + 

< ST^ 2 • exp(^<7 max ) + 2^ • exp(^ max ) (£[exp(^X 2 )]) 1/2 • (E[exp(^X^,)]) 1/2 

< 2K X (Arj- 2 + cx 2 iax ) • exp( ^cr max 



'2 

where we use the fact that t\ = IvC^fr) 1 ^ 2 < \r\ < V in the first inequality, and then use 
|c°fc| < ( at jj a kkY^ 2 — ff max in the third inequality. Now, we can apply the Markov inequality 
to obtain the following probability bound 

Pr(J2{XijXik - 4} > ne 1 ) 

i 

n 

< exp(-iinei) • JJ E [exp^X^*. - <j° fe ))] 

i=i 

< p -§^ . JJ |i + l - t \ ■ E [{X l3 X lk - a%) 2 ■ eMti\XijX ik - a%\)} } 

< P^ C1V ■ exp [h\ .J^E [(XijX* - o%) 2 ■ exp^X^ - a%\)] j 

< p-^ Cl " • exp (k^I + ^VjLj • exp(^ ( r max ) • logp) (= p^" 2 ), 

where we apply exp(-u) < 1 + u + ^w 2 exp(|w|) and E[XijX ik ] = cr® k for i — 1, 2, • • • , n in the 
second inequality, and we use 1 + u < exp(-u) in the third inequality. 
Recall that A = c ^p + Ci(^) 1/2 = e 2 Q + e x and 

°% ~ o)k = Yl X H X ik ~ a %) ~ Y X ^ ' (~ Y Xik ^ 

i i i 

Therefore, we can complete the probability bound under the exponential-tail condition as 
follows 

Pr(max \frj k - a° jk \ > A) < p 2 Pr(^ X l3 X lk > n(a% + e 1 )) + 2pPr(^ X tJ > ne ) 

i i 

< 3p- M . 

In the sequel we shall prove the probability bound under the polymonial-tail condition. 
First, we define c 2 = 8{K 2 + 1)(M + 1) and e 2 = c 2 (^) 1 / 2 . Define 5 n = n l /\\ogn)- l l 2 , 
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Yij = Xijl^x^KSn} and Z {j = X^I^x^*}- Then we have X^ = Y, tj + Z {j and E\X^ = 
E[Yij] + E[Zij]. By construction, \Yij\ < 5 n are bounded random variables, and E[Zij] are 
bounded by o(e 2 ) due to the fact that < 5~ 3 £[|Xy| 4 /{| Xij | >(5n }] < K 2 5~ 3 = o(e 2 ). 

Now we can apply the Bernstein's inequality (Bernstein 1946 Bennett , |1962 ) to obtain that 



P r(£0%-™>i- 2 ) < exp( w( -^ ) 



< 



exp 



3 

-c 2 logp 



,8#2 + 8 + 0(n- x /4) 

where the fact that variY^) < E[X$ < £[X§/ { | X ..|> 1} ] + S[Xj./ {|XiJ [< 1} ] < K 2 + 1 is used 
in the second inequality. Besides, we can apply the Markov inequality to obtain that 

PT(\Xij\ > 5 n ) < 5- 4{1+ " /+£) E [\X ij \^ 1+ ' r+ ^] < K 2 (logn) 2(1+7+e) n- 1 -' Y - £ . 

Then, we can derive the following probability bound 

Pr(J2 Xij > ne 2 ) = Pr(^{F ij + % - E\Y^ + %]} > ne 2 ) 

i i 

< Pr(X){yy - } > \ne 2 ) + Pr(^{% - E[Z tJ }} > \ne 2 ) 

i i 

< 0{p- M - 1 ) + Pr(]T{% - o(e 2 )} > \ne 2 ) 

% 



< O^-^ + ^Clogn) 2 ^ 



Let c 3 = 8(^2 + 1)(M + 2) and e 3 = c 3 (^ £ ) 1 / 2 . Recall that 5 n = (^y) 1/4 , and de- 
fine % fc = XijX ik I^ Xii \>& n or |x ife |>5„}- Tnen we nave -^y^* = + and a-° fc = 
E[XijXik] = E[YijYik] + E[Rijk]. By construction, |YijYjfc| < <5 2 are bounded random vari- 
ables, and E[Rijk] is bounded by o(e 3 ) due to the fact that 

\E{Rijk}\ < lElXijXikl^Xi^Sn})] + \ E [XijX ih I { \x ik \ >Sn }\\ 

< S- 2 - 4l E[xf +l) h\^\>s n }\ ■ E[Xl\ + ^ 2 - 47 ^[^i (1+7) /{|x lfc |>,, l} ] • E[Xl] 

< 2K 2 5~^ (=o(s 3 )). 
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Again, we can apply the Bernstein's inequality to obtain that 



Pr(^{n^ - E[YijY ik ]} > l -nez) < exp ( 



< exp 



8K 2 + 8 + ffe 

-c 3 logp 

8iT 2 + 8 + 0((logn)- 1 /2) 

o( P - M ~ 2 ), 



where the fact that var^Y*) < E[XfjXf k ] < (E[X^}E[Xf k }y/ 2 < K 2 + 1 is used. 
Pr(max | V^X^X^ - a° jk )\ > ne 3 ) 

I 

< Pr(max | V^Y^ - EfY^Yy }| > \ne 3 ) + Pr(max | - E[R ijk ]}\ > \ne 3 ) 

j,k * — ' Z j,k * — ' Z 

i i 

< 2 V Pr( V{^y ife - E[Y i3 Y lk }} > \ne 3 ) + Pr(max | J2{R ljk - o(e 3 )}\ > \ne 3 ) 

*■ — ' — ' Z j,k * — ' Z 

j,k i i 

< o( P ~ M ) + J2^(\x^\>s n ) 

< 0(p~ M ) + K 2 p(\ogn)^ 1+ ^n-^- £ 



Recall that A = c 2 ^p + c 3 (^ £ ) 1 / 2 = e\ + e 3 . Therefore, we can prove the desired 
probability bound under the polynomial-tail condition as follows 



Pr(max|o£-<7y fc | > A) 

j,k 



< Pr(max | y~]{XijX ik - <?° jk }\ > ne 3 ) + Pr(max | ^ X^\ > ne 2 ) 

% I 

< 0(p~ M ) + 3K 2 p(\og nf^+^n-^. 



□ 
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